In [1]:
import airbnb
from IPython.display import Image, HTML
HTML('''<script>
code_show=true;
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit"
value="Click here to toggle on/off the raw code."></form>''')
Out[1]:

Where to Next? Clustering AirBnB Listings in Melbourne, Australia

In [2]:
Image("Melbourne.jpg")
Out[2]:

Executive Summary

Melbourne, the capital of Victoria, Australia, is a blend of world-class restaurants, museums, art, and unique experience, making it a go-to place for tourists. It also houses main government offices and big universities. In the past years, short rental property listings like AirBnB has been very popular in Melbourne. The added long-term stay feature allowed lessees to rent the property for a longer term. This captured another customer market segment, those who are looking for an affordable place to stay for a longer period of time. For host-to-be and tourists/lessees alike, it will be helpful to understand how the listings vary to aid them in the comparison of existing listings. One difficulty, however, is using a lot of parameters or features in comparing these listings/properties. This is where dimensionality reduction and clustering can be used.

This study utilized information from AirBnB listings to identify which features (both intrinsic and extrinsic) contribute the most to the listing diversity. Correlations among features were also determined. Through the exploratory analysis and dimensionality reduction done, we found out that amenities related to long-term stays are the greatest contributor to listing diversity. This means that most hosts are gearing towards allowing their renters to stay for longer periods.

Price is the main distinguishing feature that segregate listings into clusters based on agglomerative clustering while all other features are more or less the same among clusters. Renters who are looking for either short- or long-term stays can just choose among the clusters with the same characteristics but lowest price to maximize their budget.

Although the 10-cluster model resulting from agglomerative clustering can distinguish clusters based on price of listings, a more granular study can be done by analyzing the subclusters within each cluster. This may give more insights as to the possible features that can further differentiate the clusters.

Introduction

Background and Significance of the Study

Airbnb is a platform where people can list and book accommodations around the world. In recent years Airbnb has been reshaping how people find and rent short-term accommodation. Melbourne was one of the early adopters of Airbnb, being one of the top 10 cities for global travelers on Airbnb. With over 18,000 listings as of July 2021, Airbnb proves to dominate short-term accommodations with its diverse range of property types, amenities, and hosts. All this information is easily accessible through their app. Users would be able to communicate with potential renters and vice versa. Both renters and customers have their ratings so that everyone can expect a certain level of quality.

This study aims to determine which features cluster together to discover patterns and similarities in Airbnb listings. To achieve this, we used multiple clustering methods to determine which listings cluster together and how they cluster. This would be beneficial to customers because users can simply filter the listings based on specific features of their choosing rather than browsing through countless Airbnb listings.

Problem Statement

Problem Statement: What are the features, both intrinsic and extrinsic, that contribute to clustering of AirBnB listings in Melbourne, Australia?

Secondary Research Questions:

  • Are there specific patterns on how listings are clustered together?
  • What specific locations/neighborhood do most listings cluster together?
  • Can we determine which characteristics make a listing unique?
  • What features or characteristics make listings similar?

Methodology

Shown below is the general workflow that was followed in the study (Figure 1).

First, data was retrieved from insideairbnb.com. The dataset was inspected for pre-processing. Unnecessary columns were removed. Missing values were handled either by imputation (less than 25% missing values) or by dropping the columns (more than or equal to 25% missing values). Similar but inconsistent values were corrected using regex.

After data cleaning, exploration of data was done to gather important insights. In preparation for dimensionality reduction, feature engineering was done. A binary representation was created for list-based columns such as amenities and host verifications. Finally, min-max scaling was performed.

For dimensionality reduction, truncated SVD was the chosen technique since the dataset is a mixture of dense and sparse data. Projections of the features were plotted along the singular vectors with the highest contributions to listing variation. For clustering, we explored three clustering methods—agglomerative, Kmeans, and Kmedians. We chose not to include Kmedians in our results because we did not get any significant insights. Results were analyzed to obtain insights that may be beneficial both for the hosts and those who plan to book an Airbnb in Melbourne.

In [3]:
Image("Methodology_updated.png")
Out[3]:
In [6]:
airbnb.fig_caption('General workflow for the Study', 1)

Figure 1. General workflow for the Study.

Data

Source

Data for this study is retrieved from insideairbnb.com, an independent, non-commercial set of tools and data to explore how Airbnb is really being used in cities around the world. listings.csv.gz was used for this specific study.It contains several information about the listing, host, and availability of the listing.

Information on Melbourne municipalities and neighbourhoods included in each municipality was obtained here.

Dictionary

The description of the variable names in the dataset can be found in the Data Dictionary on Table 1.

In [7]:
airbnb.table_caption('Data Dictionary for the AirBnB Dataset '
                     'and Melbourne Municipality File', 1)
Table 1. Data Dictionary for the AirBnB Dataset and Melbourne Municipality File.

File Name: listings.csv.gz

Column Type Description
id INTEGER Airbnb's unique indentifier for the listing
host_is_superhost BOOLEAN Host is a superhost
host_listings_count INTEGER Number of lisings host has
host_verifications TEXT Verified channels of the host
neighbourhoos_cleansed TEXT geocoded using latitude and longitude
latitude INTEGER World Geodetic System projection
longitude INTEGER World Geodetic System projection
property_type TEXT type of property type of the listing
room_type TEXT type of room of the listing
accomodates INTEGER Maximum capacity of listing
bedrooms INTEGER Number of bedrooms
beds INTEGER Number of beds
amenities TEXT List of amenities
price INTEGER Daily price in local currency
minimum_nights INTEGER minimum number of night stay for the listing
has_availability BOOLEAN If listing is available
availability_30 INTEGER Availability of the listing 30 days in the future
availability_60 INTEGER Availability of the listing 60 days in the future
availability_90 INTEGER Availability of the listing 90 days in the future
availability_365 INTEGER Availability of the listing 465 days in the future
number_of_reviews INTEGER Number of reviews the listing has
number_of_reviews_ltm INTEGER Number of reviews the listing has in the last 12 months
review_scores_rating INTEGER Average review score of listing
calculated_host_listings_count INTEGER Listings the host has in the current scrape
reviews_per_month INTEGER
number_of_baths INTEGER Number of baths the listing has
bath_type TEXT Type of bath the listings has

File Name: melbourne_municipalities.csv

Column Type Description
municipality TEXT Different municipalitites under Melbourne
neighbourhood_cleansed TEXT Neighbourhood names in each municipality

Data Cleaning

In [8]:
# import libraries
import pandas as pd
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

The dataset was loaded as a dataframe with 18605 rows/listings and 74 features. Data cleaning was done to ensure consistency of the formats of entries in some columns. Unnecessary columns were also removed. After cleaning and filtering to include only the main property types (apartment, house, townhouse, condominium, and guesthouse), the resulting dataframe had 16959 unique listings.

In [9]:
# Data Loading
listings = pd.read_csv('/mnt/data/public/insideairbnb/data.insideairbnb.com/'
                       'australia/vic/melbourne/2021-07-05/data/'
                       'listings.csv.gz', 
                       compression="gzip")
In [10]:
#Cleaning
df = airbnb.clean_data(listings)

pd.options.mode.chained_assignment = None  # default='warn'
amenities, with_amenities = airbnb.clean_amenities(df)

verif_df, expanded_df = airbnb.clean_verifications(with_amenities)

Missing Data

Data was explored to determine if there are columns with missing values. Table 2 shows the summary of missing values per column. For the purpose of this study, we set a threshold of 25% for missing values. All columns with more than 25% missing values were automatically removed. The resulting summary of missing values after removal of some columns is shown in Table 3.

In [11]:
airbnb.table_caption('Missing Values per Column', 2)
Table 2. Missing Values per Column.
In [12]:
# Check percent missing per column
missing = airbnb.missing(df)
missing.head(15)
Out[12]:
number of missing values % missing
calendar_updated 16959 100.0
bathrooms 16959 100.0
license 16959 100.0
review_scores_value 4249 25.0
review_scores_checkin 4249 25.0
review_scores_location 4249 25.0
review_scores_accuracy 4245 25.0
review_scores_communication 4243 25.0
review_scores_cleanliness 4242 25.0
review_scores_rating 3876 23.0
reviews_per_month 3876 23.0
bedrooms 718 4.0
beds 219 1.0
bath_type 19 0.0
number_of_baths 19 0.0
In [13]:
# Drop columns with >= 25% missing 
clean_df = airbnb.drop_missing(df, missing)
In [14]:
airbnb.table_caption('Missing Values per Column after Dropping Columns', 3)
Table 3. Missing Values per Column after Dropping Columns.
In [15]:
new_missing = airbnb.missing(clean_df)
new_missing.head(10)
Out[15]:
number of missing values % missing
reviews_per_month 3876 23.0
review_scores_rating 3876 23.0
bedrooms 718 4.0
beds 219 1.0
bath_type 19 0.0
number_of_baths 19 0.0
host_listings_count 7 0.0
host_is_superhost 7 0.0
availability_90 0 0.0
calculated_host_listings_count_shared_rooms 0 0.0

Imputation

To handle the remaining missing values, imputation was done. For categorical variables, missing values were filled with the most frequent value in that column. On the other hand, median value for the column was used to replace missing numerical values in that column. Table 4 is the working dataset after cleaning and imputation. Only 29 columns remained after cleaning.

In [16]:
airbnb.table_caption('Working Dataset after Cleaning and Imputation', 4)
Table 4. Working Dataset after Cleaning and Imputation.
In [17]:
final_df = airbnb.impute(clean_df, new_missing)
final_df.head(5)
Out[17]:
id host_is_superhost host_listings_count host_verifications neighbourhood_cleansed property_type room_type accommodates bedrooms beds ... number_of_reviews number_of_reviews_ltm review_scores_rating calculated_host_listings_count calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms reviews_per_month number_of_baths bath_type
0 9835 f 1.0 ['email', 'phone', 'reviews'] Manningham house Private room 2 1.0 2.0 ... 4 0 4.50 1 0 1 0 0.03 1.0 private bath
1 10803 f 1.0 ['email', 'phone', 'reviews', 'jumio', 'govern... Moreland apartment Private room 2 1.0 2.0 ... 145 0 4.43 1 0 1 0 2.26 1.0 shared bath
2 12936 f 13.0 ['email', 'phone', 'google', 'reviews', 'jumio... Port Phillip apartment Entire home/apt 2 1.0 1.0 ... 42 0 4.68 10 10 0 0 0.76 1.0 private bath
3 33111 f 1.0 ['email', 'phone', 'reviews'] Melbourne apartment Private room 2 1.0 1.0 ... 2 0 4.50 1 0 1 0 0.02 2.5 baths
4 38271 t 1.0 ['email', 'phone', 'manual_online', 'reviews',... Casey apartment Entire home/apt 5 3.0 3.0 ... 161 12 4.83 1 1 0 0 1.25 1.0 private bath

5 rows Ă— 29 columns

In [18]:
final_df.describe()
Out[18]:
id host_listings_count accommodates bedrooms beds price minimum_nights maximum_nights availability_30 availability_60 ... availability_365 number_of_reviews number_of_reviews_ltm review_scores_rating calculated_host_listings_count calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms reviews_per_month number_of_baths
count 1.695900e+04 16959.000000 16959.000000 16959.000000 16959.000000 16959.000000 16959.000000 16959.000000 16959.000000 16959.000000 ... 16959.000000 16959.000000 16959.000000 16959.000000 16959.000000 16959.000000 16959.000000 16959.000000 16959.000000 16959.000000
mean 2.802741e+07 10.816145 3.367239 1.600743 1.927708 164.065216 6.169055 763.556755 12.026476 26.434106 ... 135.043340 25.063801 3.699511 4.608463 8.156082 6.304912 1.696857 0.075535 1.008288 1.323191
std 1.394427e+07 33.670928 2.227095 0.924934 1.517313 380.300324 32.879693 5677.885134 12.849747 26.284007 ... 141.315201 49.569839 9.099607 0.810892 19.041949 14.535876 10.610108 0.668846 1.736500 0.606754
min 9.835000e+03 0.000000 1.000000 1.000000 0.000000 0.000000 1.000000 1.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.010000 0.000000
25% 1.701644e+07 1.000000 2.000000 1.000000 1.000000 69.000000 1.000000 60.000000 0.000000 0.000000 ... 0.000000 1.000000 0.000000 4.670000 1.000000 0.000000 0.000000 0.000000 0.190000 1.000000
50% 2.867896e+07 1.000000 2.000000 1.000000 1.000000 109.000000 2.000000 1125.000000 5.000000 23.000000 ... 89.000000 5.000000 0.000000 4.810000 1.000000 1.000000 0.000000 0.000000 0.520000 1.000000
75% 4.045636e+07 4.000000 4.000000 2.000000 2.000000 170.000000 3.000000 1125.000000 27.000000 57.000000 ... 278.000000 26.000000 3.000000 4.930000 4.000000 3.000000 1.000000 0.000000 1.110000 1.500000
max 5.084254e+07 408.000000 16.000000 16.000000 32.000000 16267.000000 1125.000000 730365.000000 30.000000 60.000000 ... 365.000000 625.000000 228.000000 5.000000 144.000000 79.000000 115.000000 12.000000 57.190000 9.500000

8 rows Ă— 21 columns

Data Exploration

To gain some insights about AirBnB listings in Melbourne, data such as number of listings, location, price, availability, and even information about the host were examined.

Listings

Number of listings

Neighbourhood

As of July 2021, 30% of listings are located in Melbourne City (Figure 2). The City of Melbourne is an area in Victoria, Australia that is found at the central area of Melbourne. Melbourne, Australia's second largest city, is the capital of Victoria, This is where the Victorian government is located. Headquarters of many companies, government and non-government agencies are also found here (About Melbourne - City of Melbourne, n.d.). For these reasons, City of Melbourne is a very strategic location for AirBnB listings.

In [28]:
airbnb.plot_neighborhood_count(final_df)
In [29]:
airbnb.fig_caption('Number of listings in Different Neighbourhoods '
                   'in Melbourne', 2)

Figure 2. Number of listings in Different Neighbourhoods in Melbourne.

Property Type

Apartments and houses are the most listed property types in Melbourne (Figure 3). One of the possible reasons why apartments are attractive investment for property listing sites like AirBnB is the number and diversity of amenities that can be offered. Good quality apartments have shared areas and unique features like gyms, pools, and parking spaces. This is advantageous for investors because they don't have to spend for separate amenities per listing. For the travellers or renters, it is a great opportunity to socialize and meet new people.

In [30]:
airbnb.plot_property_count(final_df)
In [31]:
airbnb.fig_caption('Number of listings based on Property Type', 3)

Figure 3. Number of listings based on Property Type.

Room Type

Most of the listings are entire homes/apartments (10,830 listings) and only 0.02% are of a shared or hotel room (Figure 4). AirBnB has a "long-term rental" feature wherein hosts can rent-out their place for at least twenty-eight (28) days (Airbnb Launched Long-Term Rentals, 2020). This is a good opportunity to increase occupancy rate of the listing especially during non-peak periods for travellers. For the lessee, booking AirBnB properties instead of the traditional lease may be even more affordable. Entire homes or apartments are the most appropriate property types for long-term lease.

In [32]:
airbnb.plot_room_count(final_df)
In [33]:
airbnb.fig_caption('Number of listings based on Room Type', 4)

Figure 4. Number of listings based on Room Type.

Prices

Per Neighbourhood

Figure 5 shows that Yarra Ranges has the highest average price of 318 AUD per night. This could be due to Yarra Ranges having a scenic National Park thus being a hotspot for tourists to take a vacation. In addition, it also houses community and arts centers which may be of interest to tourists (Yarra Ranges Council, n.d.).

On the other hand, Brimbank has the cheapest average price of 96 AUD per night (Figure 6).

In [34]:
airbnb.plot_avg_price_neighbour(final_df, bracket='top')
In [35]:
airbnb.fig_caption('Listings with the Highest Average Price', 5)

Figure 5. Listings with the Highest Average Price.

In [36]:
airbnb.plot_avg_price_neighbour(final_df, bracket='low')
In [37]:
airbnb.fig_caption('Listings with the Lowest Average Price', 6)

Figure 6. Listings with the Lowest Average Price.

Property and Room Type

Apartments and houses have a large price range compared to others (Figure 7). This can be attributed to diversity in number and type of amenities that can be offered as well as the size of properties listed. On the other hand, the more expensive room types are entire home/apartment and private rooms (Figure 8).

In [38]:
airbnb.plot_prop_price(final_df)
In [39]:
airbnb.fig_caption('Box Plots of Prices for Different Property Types', 7)

Figure 7. Box Plots of Prices for Different Property Types.

In [40]:
airbnb.plot_room_price(final_df)
In [41]:
airbnb.fig_caption('Box Plots of Prices for Different Room Types', 8)

Figure 8. Box Plots of Prices for Different Room Types.

Amenities

Parking is the most common ammenity found in the listings in Melbourne (Figure 9). Other top amenities include tv, kitchen, essentials, wifi, washer, heating, hangers, and airconditioning. These top amenities are important features of a property that would accommodate long term stays. This just shows that most of the listings are equipped for longer rental terms.

In [42]:
airbnb.plot_amenities(amenities)
In [43]:
airbnb.fig_caption('Frequency of Amenities Found in Listings in Melbourne', 9)

Figure 9. Frequency of Amenities Found in Listings in Melbourne.

Availability

Rental business should be profitable. It is therefore necessary to look at the occupancy of existing listings in Melbourne. Figure 10 shows that on average, listings in Yarra have the lowest availability for the next year. Yarra may be a good neighbourhood to consider when planning to put up a property for AirBnB. On the other hand, property in Melton have generally the highest number of days available for the next 365 days.

In [44]:
airbnb.plot_availability(final_df)
In [45]:
airbnb.fig_caption('Availability (Days) of Listings '
                   'in the Next 30, 60, 90, and 365 '
                   'days', 10)

Figure 10. Availability (Days) of Listings in the Next 30, 60, 90, and 365 days.

Host

Verifications

For people who are interested to enlist properties in AirBnB, one of an important factor is the ease of business. There are many ways by which hosts are verified. Over 90% of the hosts in AirBnB use their email as means of verification on the platform (Figure 11).

In [46]:
airbnb.plot_verifications(verif_df)
In [47]:
airbnb.fig_caption('Means of Verification of Hosts of Listings in Melbourne', 
                   11)

Figure 11. Means of Verification of Hosts of Listings in Melbourne.

Listings per Host

Another important factor to look at is the number of listings per host. This may shed light as to how many listings can one host manage without sacrificing quality. Hosts in Melbourne usually have more than 1 listing (Figure 12). One certain host has 408 listings. This certain host could possibly be a hotel, condominium or apartment owner who rents individual rooms of their property.

In [48]:
airbnb.listing_per_host(final_df)
In [49]:
airbnb.fig_caption('Box Plot of the Number of Listings per Host', 12)

Figure 12. Box Plot of the Number of Listings per Host.

Reviews

Lastly, it is important to consider the general review scores of listings in Melbourne. AirBnB listings in Melbourne generally have a higher reviews with the median being at 4.275 (Figure 13). This indicates that customers are generally satisfied with the quality of the property and their experience when they booked that property. There are many property management companies in Melbourne whose main function is to manage the property on behalf of the host. They know the current trends in short-term rentals including what the tourists or lessees usually look for. This may be one of the factors for such high ratings.

In [50]:
airbnb.rating_host(final_df)
In [51]:
airbnb.fig_caption('Box Plot of the Review Score Ratings of '
                   'Listings in Melbourne', 13)

Figure 13. Box Plot of the Review Score Ratings of Listings in Melbourne.

Feature Engineering

After data cleaning, feature engineering was done. A binary bag-of-words representation was created for list-based columns such as amenities and host verifications. The resulting dataset after one-hot encoding of categorical variables and binary bag-of-words representation is shown in Table 5.

In [37]:
airbnb.table_caption('Dataset after One-hot Encoding of Categorical '
                     'Variables and Binary Representation of Amenities '
                     'and Host Verifications', 5)
Table 5. Dataset after One-hot Encoding of Categorical Variables and Binary Representation of Amenities and Host Verifications.
In [38]:
# One-hot encoding of all categorical variables 

ohe_df = airbnb.ohe(final_df)
ohe_amenities = airbnb.ohe_amenities(ohe_df, amenities)
ohe_all = airbnb.ohe_verifications(ohe_amenities, verif_df)
ohe_all.head()
Out[38]:
id host_listings_count accommodates bedrooms beds price minimum_nights maximum_nights availability_30 availability_60 ... host_verifications_offline_government_id host_verifications_phone host_verifications_reviews host_verifications_selfie host_verifications_sent_id host_verifications_sesame host_verifications_sesame_offline host_verifications_weibo host_verifications_work_email host_verifications_zhima_selfie
0 9835 1.0 2 1.0 2.0 60.0 1 365 30 60 ... 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 10803 1.0 2 1.0 2.0 28.0 4 14 0 0 ... 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 12936 13.0 2 1.0 1.0 95.0 3 14 0 0 ... 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
3 33111 1.0 2 1.0 1.0 1000.0 1 730 30 60 ... 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 38271 1.0 5 3.0 3.0 101.0 1 14 11 39 ... 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0

5 rows Ă— 268 columns

Feature Scaling

Finally, minmax scaling was done. There were a total of 267 features to analyze.

In [39]:
scaled_df, df_id = airbnb.scaling(ohe_all)

Results and Discussion

Dimensionality Reduction

Truncated SVD was performed for dimensionality reduction. This specific technique was chosen since the dataset is a mixture of dense and sparse data.

In [40]:
X_new, exp_var, sv_comp = airbnb.truncated_svd(scaled_df)

Figure 14 below shows the variance explained by each SV as well as the cumulative variance. If we choose 90% as threshold for the explained variance, we will reduce the number of features from 267 to 75 SVs.

In [41]:
sv_cutoff = airbnb.plot_variance(exp_var)
In [42]:
airbnb.fig_caption('Individual and Cumulative Variance of the SVs', 14)

Figure 14. Individual and Cumulative Variance of the SVs.

In [43]:
df_sv = airbnb.sv_comp_df(scaled_df, ohe_all, sv_comp)
In [49]:
airbnb.plot_SVcomponents(df_sv, sv_comp)
In [44]:
airbnb.fig_caption('Weights of the Top Features for SV1, SV2, '
                   'SV3, and SV4', 15)

Figure 15. Weights of the Top Features for SV1, SV2, SV3, and SV4.

Weight of the top features for the first four SVs are shown in Figure 15. SV1 explains about 61.5% of the variance observed in AirBnB listings in Melbourne. Features related to host verifications (host_verifications_phone, host_verifications_email), availability (has_availability_t), amenities (amenities_kitchen, amenities_essentials, amenities_smoke_alarm, amenities_wifi, amenities_washer, amenities_heating, amenities_hangers, amenities_long term stays allowed, amenities_tv, amenities_iron, and amenities_airconditioning), and review_score_rating contribute the most to this SV. It seems that this SV explains features that are related to long term stays of renters.

SV2 explains about 4.3% of the variation in the listings. For SV2, the top features in terms of weight are mostly related to amenities: amenities_refrigerator, amenities_oven, amenities_microwave, amenities_stove, amenities_dishes silverware, amenities_cooking basics, amenities_bed linens, amenities_dishwasher, amenities_patio or balcony, amenities_extra pillows blankets, amenities_hotwater, and amenities_coffee maker. The features that mostly contribute to SV2 are amenities needed for cooking.

About 1.9% of the variation in the listings can be explained by SV3. Mostly features that are intrinsic to the listings, such as room type, property type, bath type, as well as the neighborhood in which the listings are located contributed most to SV3. Lastly, SV4 can be explained by features which are mostly about future availability of the listings as well as those related to host verifications.

In general, the feature that contributes most to the diversity of listings in Melbourne are amenities.

Pairwise (by SV) scatterplots of listings and projection of features along these SVs are shown in Figures 16 to 18.

In [50]:
airbnb.plot_svd_zoomed(X_new, df_sv, sv_comp, 1, 2)
In [45]:
airbnb.fig_caption('Scatterplot of Listings plotted along SV1 and SV2 (left)'
                   ' and Projection of the Original Features along the '
                   'same SVs (right)', 16)

Figure 16. Scatterplot of Listings plotted along SV1 and SV2 (left) and Projection of the Original Features along the same SVs (right).

In [51]:
airbnb.plot_svd(X_new, df_sv, sv_comp, 2, 3)
In [46]:
airbnb.fig_caption('Scatterplot of Listings plotted along SV2 and SV3 (left)'
                   ' and Projection of the Original Features along the '
                   'same SVs (right)', 17)

Figure 17. Scatterplot of Listings plotted along SV2 and SV3 (left) and Projection of the Original Features along the same SVs (right).

In [52]:
airbnb.plot_svd(X_new, df_sv, sv_comp, 1, 3)
In [47]:
airbnb.fig_caption('Scatterplot of Listings plotted along SV1 and SV3 (left)'
                   ' and Projection of the Original Features along the '
                   'same SVs (right)', 18)

Figure 18. Scatterplot of Listings plotted along SV1 and SV3 (left) and Projection of the Original Features along the same SVs (right).

Clustering

Agglomerative Clustering

To determine the propensity of AirBnB listings in Melbourne to cluster based on their overall features, both hierarchical clustering (Agglomerative using Ward's Method) and representative-based clustering (Kmeans and KMedians) were performed.

In [48]:
model = airbnb.agglo_cluster(X_new)
In [49]:
linkage_matrix = airbnb.plot_agglo(X_new, model)
In [50]:
airbnb.fig_caption('Dendogram showing the clustering of AirBnB listings'
                   ' in Melbourne ', 19)

Figure 19. Dendogram showing the clustering of AirBnB listings in Melbourne .

In [51]:
y_pred_300 = airbnb.agglo_fcluster(X_new, linkage_matrix, 300)
In [52]:
y_pred_300 = airbnb.agglo_fcluster(X_new, linkage_matrix, 300)
In [53]:
airbnb.fig_caption('Scatterplot showing the clustering of AirBnB listings'
                   ' in Melbourne at Delta = 300', 20)

Figure 20. Scatterplot showing the clustering of AirBnB listings in Melbourne at Delta = 300.

In [54]:
y_pred_250 = airbnb.agglo_fcluster(X_new, linkage_matrix, 250)
In [55]:
airbnb.fig_caption('Scatterplot showing the clustering of AirBnB listings'
                   ' in Melbourne at Delta = 250', 21)

Figure 21. Scatterplot showing the clustering of AirBnB listings in Melbourne at Delta = 250.

In [56]:
y_pred_150 = airbnb.agglo_fcluster(X_new, linkage_matrix, 150)
In [57]:
airbnb.fig_caption('Scatterplot showing the clustering of AirBnB listings'
                   ' in Melbourne at Delta = 150', 22)

Figure 22. Scatterplot showing the clustering of AirBnB listings in Melbourne at Delta = 150.

K-Means

In [58]:
from sklearn.cluster import KMeans
res_kmeans = airbnb.cluster_range(X_new[:, :75], KMeans(random_state=143), 2, 11)
In [59]:
#Plot SV1 and SV2
airbnb.plot_clusters(X_new[:, :75], res_kmeans['ys'], 0, 1)
In [60]:
airbnb.fig_caption('Scatterplot showing the clustering of AirBnB listings'
                   ' at different values of k', 23)

Figure 23. Scatterplot showing the clustering of AirBnB listings at different values of k.

In [61]:
#Plot SV2 and SV3
airbnb.plot_clusters(X_new[:, :75], res_kmeans['ys'], 1, 2)
In [62]:
airbnb.fig_caption('Scatterplot showing the clustering of AirBnB listings'
                   ' at different values of k', 24)

Figure 24. Scatterplot showing the clustering of AirBnB listings at different values of k.

In [63]:
airbnb.plot_num_clusters(res_kmeans)
In [64]:
airbnb.fig_caption('Plots of Different Internal Validation'
                   ' Criteria', 25)

Figure 25. Plots of Different Internal Validation Criteria.

The scatterplots at different values of k both along SV1 & SV2 and SV2 & SV3, showed that the best clustering is at k=3 (Figures 23 and 24). The points in the same cluster are compact and balanced. Also, the clustering is parsimonious. This is further confirmed based on the internal validation criteria such as Sum of squares distances to centroids (Inertia), Calinski-Harabasz index, and Silhouette coefficient (Figure 25).

In [65]:
# Data with cluster labels
labelled_df = airbnb.df_with_labels(df_id, scaled_df, y_pred_150, y_pred_250, y_pred_300)
features_with_clusters = labelled_df.merge(final_df, how='left',
                                          left_on='id', right_on='id')
features = ['id','neighbourhood_cleansed', 'property_type', 'room_type',
           'amenities', 'price_x', 'cluster_kmeans', 'cluster_agg_d150',
           'cluster_agg_d250', 'cluster_agg_d300']
clusters_df = features_with_clusters[features]
clusters_df.head()
Out[65]:
id neighbourhood_cleansed property_type room_type amenities price_x cluster_kmeans cluster_agg_d150 cluster_agg_d250 cluster_agg_d300
0 9835 Manningham house Private room wifi, long term stays allowed 60.0 1 4 1 1
1 10803 Moreland apartment Private room dedicated workspace, microwave, hot water, iro... 28.0 2 7 4 2
2 12936 Port Phillip apartment Entire home/apt dedicated workspace, microwave, hot water, iro... 95.0 2 10 4 2
3 33111 Melbourne apartment Private room cable tv, washer, long term stays allowed, bre... 1000.0 1 1 1 1
4 38271 Casey apartment Entire home/apt dedicated workspace, microwave, hot water, iro... 101.0 2 6 3 2

Analysis of Clusters: KMeans and Agglomerative Clustering

The distribution of listing prices per cluster for both KMeans and Agglomerative Clustering were compared (Figures 26). Based on the figures, we can clearly see that the 3-cluster model resulting from KMeans Clustering cannot differentiate listings based on prices. On the other hand, the 10-cluster model from Agglomerative Clustering shows some differences in the average listings prices. Table 6 shows the average price per cluster based on the 10-cluster model. Cluster 1 has the lowest average price (92.68 AUD) while cluster 3 has the highest average price (212.73 AUD).

Analysis of other features such as location (neighborhood), property type and room type showed almost the same features available in the clusters which means these cannot distinguish one cluster from another.

We also performed an analysis on the listings using KMedians clustering but our results were not significant enough. We determined this when we saw no clear clusters being formed at any given K, so we decided to omit it from our results and discussion.

From this point forward, analysis of the amenities will be done based on the 10-cluster model from Agglomerative clustering.

In [73]:
import matplotlib.pyplot as plt
import seaborn as sns
fig, ax = plt.subplots(2, 1, figsize=(15, 25))
for n, i in enumerate(['cluster_agg_d150', 'cluster_kmeans']):
    sns.boxplot(x=clusters_df[i], y=clusters_df['price_x'], ax=ax[n], showfliers=False)
In [66]:
airbnb.fig_caption('Boxplots showing the distribution of listing prices '
                   'in Different Clusters Using Agglomerative Clustering '
                   '(Upper) and KMeans Clustering (Lower)', 26)

Figure 26. Boxplots showing the distribution of listing prices in Different Clusters Using Agglomerative Clustering (Upper) and KMeans Clustering (Lower).

In [67]:
airbnb.table_caption('Average Price of AirBnB Listings per Cluster', 6)
Table 6. Average Price of AirBnB Listings per Cluster.
In [127]:
price_per_cluster = pd.DataFrame(clusters_df.groupby('cluster_agg_d150')['price_x'].mean())
price_per_cluster = price_per_cluster.rename(columns = {'price_x': 'price ($)'})
price_per_cluster
Out[127]:
price ($)
cluster_agg_d150
1 92.680400
2 166.747126
3 212.726016
4 195.720729
5 185.518519
6 170.621195
7 104.265776
8 149.505362
9 174.730826
10 203.355769
In [102]:
cols = list(range(1,80))

amenities_only = pd.concat([with_amenities['id'], with_amenities[cols]], axis=1)
cluster_amenities = clusters_df.merge(amenities_only, how='left',
                                     left_on='id', right_on='id')
cols_to_drop = ['neighbourhood_cleansed', 'property_type', 'room_type', 
                'amenities', 'price_x', 'cluster_kmeans',
                'cluster_agg_d250', 'cluster_agg_d300']
cluster_amenities = cluster_amenities.drop(cols_to_drop, axis=1)

Analysis of the top 10 amenities of the clusters revealed that the 10-cluster model from agglomerative clustering resulted to more or less the same types of amenities: wifi, kitchen, essentials, parking, washer, heating, long term stays allowed, tv, hangers, and airconditioning. The clusters just have very minimal differences in terms of the arrangement of the most common amenities. Tables showing the most frequent amenities for some clusters are shown below.

Our clustering analysis showed that price is the main contributor as to how the listings clustered together.

In [108]:
airbnb.table_caption('Most Common Amenities in Cluster 1', 7)
Table 7. Most Common Amenities in Cluster 1.
In [103]:
df1 = cluster_amenities[cluster_amenities['cluster_agg_d150']==1]

amenities_count = pd.DataFrame(df1.stack().value_counts())
top_amenities = amenities_count.head(11).reset_index().rename(columns={'index':'amenities',
                                                                      0:'count'})
top_amenities.iloc[1:]
Out[103]:
amenities count
1 wifi 2305
2 kitchen 2278
3 essentials 2214
4 parking 2206
5 washer 2062
6 heating 2030
7 long term stays allowed 2016
8 tv 1927
9 hangers 1898
10 airconditioning 1627
In [109]:
airbnb.table_caption('Most Common Amenities in Cluster 4', 8)
Table 8. Most Common Amenities in Cluster 4.
In [106]:
df4 = cluster_amenities[cluster_amenities['cluster_agg_d150']==4]

amenities_count = pd.DataFrame(df4.stack().value_counts())
top_amenities = amenities_count.head(11).reset_index().rename(columns={'index':'amenities',
                                                                      0:'count'})
top_amenities = top_amenities.drop(2)
top_amenities
Out[106]:
amenities count
0 parking 3498
1 tv 3417
3 kitchen 3177
4 essentials 3111
5 wifi 3075
6 heating 3073
7 washer 2997
8 long term stays allowed 2877
9 airconditioning 2826
10 hangers 2803
In [110]:
airbnb.table_caption('Most Common Amenities in Cluster 9', 9)
Table 9. Most Common Amenities in Cluster 9.
In [99]:
df9 = cluster_amenities[cluster_amenities['cluster_agg_d150']==9]

amenities_count = pd.DataFrame(df9.stack().value_counts())
top_amenities = amenities_count.head(6).reset_index().rename(columns={'index':'amenities',
                                                                      0:'count'})
top_amenities = top_amenities.drop(1)
top_amenities
Out[99]:
amenities count
0 parking 1386
2 kitchen 1298
3 tv 1280
4 essentials 1278
5 wifi 1236

Conclusion

Aside from the United States and European countries, Australia is one destination worth considering for tourists. Short term rentals like AirBnB are very popular especially in areas like Melbourne.

Through this exploratory analysis of AirBnB listings in Melbourne, we were able to determine that amenities like wifi, kitchen, essentials, parking, washer, heating, long term stays allowed, tv, hangers, and airconditioning are the overall most important features contributing to the variation in the listings in Melbourne. These features as somewhat related to long term stays which could mean that most hosts are gearing towards allowing their renters to stay for longer periods.

Clustering results showed that price is the main distinguishing feature that segregate listings into clusters. All other features such as neighborhoods, property types, and room types are well-represented in the clusters, which means they cannot discriminate among clusters. Knowing that there are differences in prices among clusters, renters who are looking for either short- or long-term stays can just choose among the clusters with the same characteristics but lowest price to maximize their budget.

Recommendations

This exploratory data analysis is just a preliminary step in understanding AirBnB listings. There are certain challenges encountered during the conduct of the study. One of these is the integrity of the data obtained from insideairbnb.com. We noticed some listings with a nightly price of 1 AUD, which seems to be impossible. There are also hosts with zero total listings but are still included in the dataset.

Possible future works include listings in other areas of Australia to have a more comprehensive analysis of the Airbnb market in the country. Another possibility is to get the geographical polygons of the different neighborhoods in Melbourne to see better which neighborhoods cluster together based on different features. Exploring the subclusters of each cluster could also be helpful for future works to provide a more granular view of different features of the subclusters. Customers would be able to use this to identify which listings would be the perfect destination for their getaway. Renters may take advantage of this to modify their listings so they may target a specific target market. Lastly, Airbnb can use this information to further help their recommender system maximize the engagement within their app.

References

  1. About Melbourne - City of Melbourne. (n.d.). City of Melbourne. Retrieved August 27, 2021, from https://www.melbourne.vic.gov.au/about-melbourne/Pages/about-melbourne.aspx
  1. Airbnb launched long-term rentals. (2020, May 4). GuestReady’s Airbnb Hosting Blog. https://www.guestready.com/blog/airbnb-launched-long-term-rentals/
  1. Inside Airbnb. Adding data to the debate. (n.d.). Inside Airbnb. Retrieved August 27, 2021, from http://insideairbnb.com/about.html
  1. Yarra Ranges Council. (n.d.). Yarra Ranges. Retrieved August 27, 2021, from https://www.yarraranges.vic.gov.au/Home
In [ ]: